Data Mining And Data Warehousing
Capacity Planning
Disk Space Requirement of Data Warehouse
Disk space required to set up data warehouse is calculated using size of following tables:
- Fact Table, F
- Dimension Table, D
- Aggregate Table, A
- Index Table for the Fact Table, Fi
- Index Table for the Dimension Table, Di
- Index Table for the Aggregate Table, Ai
- Temporary Space
- Sort Space
| Required Disk space | = F + Fi+A + Ai + D + Di + T + S | |
| = 3F + Fi+ D + Di + T + S | as, F = A = Ai |
| = 3F + 0.3F + D + Di + T + S | as, Index of Fact is approximate 30% of Fact Table |
| = 3F + 0.3F+0.1F+0.1F + T + S | as D near equal to Di and D is 10% of Fact Table |
| Required Disk Space | = 3.5F + T + S | |
| Required Disk Space | <= 3.5F + T | as S is less than T and F |
Temporary Space is required to perform tasks like loading operational data, cleaning and transforming data.
Temporary space requirement is calculated using number of concurrent queries, n, that is allowed to execute and
size of largest partition, P.
Temporary Space, T = (2n+ 1)P
For example, let's suppose we have Fact table of size 30 GB and we have to keep 4 years data live.
Total fact data is 30 * 4 = 120 GB
| Disk Space Required | = 3.5 F + T |
| = 3.5 * 120 + T |
| = 420 + T GB |
| |
| T | = (2n + 1 ) P, let n = 4 and P = 4GB |
| |
| Disk Space Required | = 420 + (2n+1)P |
| =420 +(2 * 4 + 1) * 4 |
| =420 +36 |
| =456 GB |
Required CPU capacity
First, we have to find the time required to access the largest amount of data warehouse data.
Fact table is the largest table in the data warehouse. So, using Fact Table (F), Scan rate of I/O device,
we can calculate the time required to scan the whole Fact table, T = F/S --- (i)
Then we have to get the response time excepted by the users, Tp
Then we calculate required scan rate according to user's response time, Sp = F/Tp
After getting the required scan rate we can know how much CPU will be needed to satisfy that scan rate.